#### Memory Management

Chapter 3: Section 3.1-3.3

#### Memory Management

- » Paraphrase of Parkinson's Law, "Programs expand to fill the memory available to hold them."
- » Average home computer nowadays has 10,000 times more memory than the IBM 7094, the largest computer in the world in the early 1960s

## Memory Hierarchy

- » How does the operating system create abstractions from memory, and how does it manage them?
- » Memory hierarchy:
  - » Few megabytes of very fast, very expensive, volatile cache memory
  - » A few gigabytes of medium-speed, medium priced, volatile main memory
  - » A few terabytes of slow, cheap, nonvolatile magnetic or solid state disk storage.

## Memory Hierarchy

- » Memory Manager manages it
  - » Tracks which parts of memory are in use
  - » Allocates memory to processes
  - » Deallocates memory when processes are done.

## No Memory Abstraction

- » Early mainframes (before 1960), early minicomputers (before 1970), early personal computers (before 1980) had no memory abstraction
- » Every program saw physical memory.
- » Programmers saw a set of addresses from 0 to some maximum.

#### No Memory Abstraction

- » Two programs could not be resident in memory at the same time.
  - » Address conflicts
  - » Could run multi-threaded since threads share address space.
  - » Limited use. Users want unrelated programs to be running at once.

## Memory Layout



## No Memory Abstraction

- » Multiple programs possible
  - » Save entire contents of memory to a disk file
  - » Bring in new and run it
  - » Swapping.
- » Additional hardware will allow concurrency without swapping.

## Switcher

- Switcher worked by designating a number of fixed "slots" in memory
  - Applications could be loaded into those slots
  - The user could then switch between these applications by clicking a button above the menu bar. The current application would horizontally slide out of view, and the next one would slide in.
  - Worked with existing memory management so changes were transparent to the OS and to the running programs



#### Switcher



by Andy Hertzfeld

Version 4.4 -- August 12, 1985 © 1985 Apple Computer, Inc.



#### Helpful Hints:

Use  $\mathfrak{H}[$  and  $\mathfrak{H}[$  to rotate between applications.

Use  $\Re$ \ to return back to the switcher.

Use the option key to transport the clipboard between applications (or not).
Use %-shift-option-period as an "emergency exit" to exit hung applications.
The Finder can be run under the Switcher; open Switcher to quit from the Finder.
Click on the screen of the Mac icon to toggle saving screen bits to save 22K.

Thanks to John Markoff and Bud Tribble.

http://www.folklore.org/StoryView.py? project=Macintosh&story=Switcher.txt

# Switcher Memory Layout

High Memory



Unallocated

## No Memory Abstraction

» IBM 360 solution: static relocation

» Extra info needed to specify what is relocatable

| is     |       |        |       |
|--------|-------|--------|-------|
| 0      | 16380 | 0      | 16380 |
| U      | 10300 | 0      | 10300 |
| :      |       | :      |       |
| ADD    | 28    | CMP    | 28    |
| MOV    | 24    |        | 24    |
|        | 20    |        | 20    |
|        | 16    |        | 16    |
|        | 12    |        | 12    |
|        | 8     |        | 8     |
|        | 4     |        | 4     |
| JMP 24 | 0     | JMP 28 | 0     |
| (a)    |       | (b)    |       |

CMP

**JMP 28** 

ADD

MOV

JMP 24

(c)

#### Address Spaces

- » Exposing memory to processes have several drawbacks
  - » If user programs can address every byte of memory, the OS can be trashed intentionally or by accident.
  - » Difficult to have multiple programs running at once

#### Address Spaces

- » Abstract memory
  - » Process creates a virtual CPU abstraction
  - » Address space is abstract memory for a process to live in
- » Address space: Set of all addresses that a process can see to address memory
- » Each process has an independent address space

#### Address Spaces

- » Need to map address 28 in one process and address 28 in another process to a separate physical address.
- » Dynamic relocation
- » Older solution: two special registers
  - » Base register
  - » Limit register

#### Physical v. Logical Addresses

- » Originally programs were compiled to reference addresses that corresponded one to one with physical memory addresses.
- » Unknowingly using concept of logical and physical memory
  - Logical addresses Set of addresses the CPU generates as the program is executed.
  - Physical addresses Set of addresses used to reference physical memory.

#### Dynamic Relocation



Physical address

## Dynamic Relocation



## Swapping

- » If physical memory is large enough to hold all the processes, the previous schemes will do
- » In practice, total amount of RAM is more that can fit in memory.
- » Two approaches:
  - » Swapping: Bring in each process in its entirety, run it for a while, then put it back onto disk
  - » Virtual Memory: Run processes when they are partially in memory

## Swapping



#### Heap Fragmentation

#### Heap Space

unused 16-byte chuck

chunk for variable D

unused 16-byte chuck

chunk for variable C

unused 16-byte chuck

chunk for variable E

unused 16-byte chuck

chunk for variable B

unused 16-byte chuck

chunk for variable A

unused 16-byte chuck

96 bytes of free memory but we can't allocate any chuck larger than 16 bytes

External Fragmentation

#### Compaction

» The operating system can reorganize the fragmented space so that the free space is contiguous.

#### » Expensive

On a 16-GB machine that can copy 8 bytes in 9 sec, it would take 16 sec to compact all of the memory

- » How much should the OS allocate?
  - » If processes can't grow, then allocate exactly the amount needed
  - » If the process can grow, and there is a hole next to the process, it can be allocated and the process grow into it.

- » How much should the OS allocate?
  - » If processes doesn't have an adjacent hole, it will have to move to a hole in memory large enough or
  - » Swap out one or more processes.
  - » If it can't grow and no free swap space, suspend or kill the process

- » Good idea to allocate a little extra when a process is swapped in to reduce the overheard with moving or swapping processes.
- » Only memory used should be swapped, otherwise wasteful.



## Managing Free Memory

- » Two ways to track memory usage
  - » Bit maps
  - » Free lists

» Chapter 10: Linux memory allocators such as buddy and slab.

#### Bitmap



- » With bitmap, memory divided into allocation units as small as a few words to a couple kilobytes.
- » Each allocation unit corresponds to a bit

#### Bitmap



» Problem with bitmap: When bringing in K-units, the memory manager must search the bitmap for a run of k consecutive 0's

#### Bitmap



» Problem with bitmap: When bringing in K-units, the memory manager must search the bitmap for a run of k consecutive 0's

#### Linked List



» Linked list: list of allocated and free memory segments.

#### Linked List



#### Allocating Memory

Heap Space

Heap Header

MPT

64 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

Example: We want to allocate a new 16 byte block.

Which free block does the OS choose?

#### First Fit

Heap Space

. .

Heap Space

Heap Header

MPT

New chunk allocated

Heap Header

MPT

16 bytes chunk

48 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

64 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

#### Best Fit

New chunk

allocated

Heap Space

Heap Header

MPT

64 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

Heap Space

Heap Header

MPT

64 bytes free

12 bytes chunk

16 bytes chunk

16 bytes free

12 bytes chunk

Heap Terminator

#### Worst Fit

Heap Space

Heap Header

MPT

New chunk allocated

Heap Space

Heap Header

MPT

16 bytes chunk

48 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

64 bytes free

12 bytes chunk

32 bytes free

12 bytes chunk

Heap Terminator

## Quick Fit



Maintain list for common sizes

#### Memory Allocation

- » Can be optimized by keeping separate lists for processes and holes.
- » If separate lists kept then the hole list should be sorted by size

# What if the program size exceeds physical RAM?

# Overlays

- » Memory wasn't always cheap. Could be a large part of total system price.
- » How do you write large programs for small memories?
- » init, main, finalize common programming pattern
  - Don't have to be in memory at the same time.

# Overlays

- » The programmer specifies the overlays
- » An overlay can not make calls into another overlay.
- » Actual loading of overlays into memory would be done by the runtime library

# Program Overlay Layout



# We need help, but why?

- » Multiprocessing causes external fragmentation
- » Compaction wastes resources
- » Solution break RAM into fixed parts
- » Do not need to be contiguous
- » Map from logical to physical spaces
- » Need hardware help

# Paging

- » Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter is available
- » Divide physical memory into fixed-sized blocks called frames (size is power of 2, between 512 bytes and 8,192 bytes)
  - Commonly 4 KB
- » Divide logical memory into blocks of same size called pages
- » Keep track of all free frames
- » To run a program of size n pages, need to find n free frames and load program
- » Set up a page table to translate logical to physical addresses
  - Holds the map to the physical address
- » Internal fragmentation

# Paging Hardware



- » Address generated by CPU is divided into:
  - Page number (p) used as an index into a page table which contains base address of each page in physical memory
  - Page offset (d) combined with base address to define the physical memory address that is sent to the memory unit

- For given logical address space 2<sup>m</sup> and page size 2<sup>n</sup>

| page number | page offset |
|-------------|-------------|
| p           | d           |
| m - n       | n           |

- For given logical address space 2<sup>m</sup> and page size 2<sup>n</sup>



- For a system with 4 GB of addressable logical address space and a page size of 4 KB, the logical address is:

| page number | page offset |
|-------------|-------------|
| p           | d           |
| m - n       | n           |

Page size:  $4096 = 2^{12}$ 

Address space:  $4294967296 = 2^{32}$ 

| page number | page offset |
|-------------|-------------|
| 20          | 12          |

- Number of pages:

» And to verify our numbers:

$$1048576 = 2^{20}$$

| page number | page offset |
|-------------|-------------|
| 20          | 12          |

For a system with 12 bit addressing:

1. What is the maximum size of addressable real memory?

If the addresses are split into a page (6 bits) and an offset (displacement, 6 bits)

- 1. How large is a memory frame?
- 2. How many entries (maximum) is the page table?

For a system with 12 bit addressing:

1. What is the maximum size of addressable real memory?

 $2^{12} = 4096$  bytes

If the addresses are split into a page (6 bits) and an offset (6 bits)

1. How large is a memory frame?

2. How many entries (maximum) is the page table?

$$2^6 = 64 \text{ pages}$$

# Implementation of Page Tables

- » In this scheme every data/instruction access requires two memory accesses. One for the page table and one for the data/instruction.
  - Half speed

» The two memory access problem can be solved by the use of a special fast-lookup hardware cache called associative memory or translation look-aside buffers (TLBs)

#### Translation Lookaside Buffer



- All entries search in parallel
- Complex circuit so it's small
- Rarely over 1000 entries

#### Inside the TLB



# Implementation of Page Tables

- » Nehalem has a two-level TLB:
  - » The first level of TLB is shared between data and instructions.
    - » The level 1 data TLB now stores 64 entries for small pages (4K) or 32 for large pages (2M/4M),
    - » The level 1 instruction TLB stores 128 entries for small pages (the same as with Core 2) and seven for large pages.
    - » The second level is a unified cache that can store up to 512 entries and operates only with small pages.

#### Effective Access Time

- »Associative Lookup =  $\varepsilon$  time unit
- »Assume memory cycle time = M
- »Hit ratio percentage of times that a page number is found in the associative registers; ratio related to number of associative registers
- »Hit ratio =  $\alpha$

#### **Effective Access Time (EAT)**

$$EAT = (M + \varepsilon) \alpha + (2M)(1 - \alpha)$$



2M because a miss requires us to get the frame number out of the page table (M cost) and get the normal memory reference (M cost)

#### Effective Access Time Example

```
»Associative Lookup = \epsilon = 5 nanoseconds
»Assume memory cycle time is 100 nanoseconds
»Hit ratio = \alpha = 80%
```

**Effective Access Time (EAT)** 

EAT = 
$$(100 + 5) .8 + (200)(1 - .8)$$
  
= 124 nanoseconds

#### Effective Access Time

- » First equation assumed TLB and Page Table lookups happened in parallel.
- » If not:

#### **Effective Access Time (EAT)**

$$\mathrm{EAT} = (\mathrm{M} + \epsilon) \ \alpha + (2\mathrm{M} + \epsilon)(1 - \alpha)$$

#### Context Switches

- » Most systems TLB doesn't concern itself with which process is running.
  - Context switch means TLB must be flushed.
  - First few memory accesses of the new process will not get any TLB hits so the process will run at half speed for memory references.
    - Threads vs. processes
- » Some TLBs store address-space identifiers (ASIDs) in each TLB entry – uniquely identifies each process to provide address-space protection for that process

- » Our previous methods had one requirement:
  - The instructions being executed must be in physical memory.
  - The first approach to meeting this requirement is to place the entire logical address space in physical memory.
  - Dynamic loading can help to ease this restriction
    - Requires special precautions and extra work by the programmer.

- » Up until now we have assumed all programs fit fully into memory before they execute.
- » Virtual memory is a technique that allows the execution of processes that are not completely in memory.
- » One major advantage of this scheme is that programs can be larger than physical memory.

- » The requirement that instructions must be in physical memory to be executed seems both necessary and reasonable;
  - Also unfortunate, since it limits the size of a program to the size of physical memory.
- » In fact, an examination of real programs shows us that, in many cases, the entire program is not needed. For instance, consider the following:
  - Programs often have code to handle unusual error conditions. Since these errors seldom, if ever, occur in practice, this code is almost never executed.
  - Arrays, lists, and tables are often allocated more memory than they actually need. An array may be declared 100 by 100 elements, even though it is seldom larger than 10 by 10 elements. An assembler symbol table may have room for 3,000 symbols, although the average program has less than 200 symbols.
  - Certain options and features of a program may be used rarely

- » Even in those cases where the entire program is needed, it may not all be needed at the same time.
  - Locality of Reference Only a few pages of memory are accessed by the process in any given time slot.

#### Locality In A Memory-Reference Pattern



- » The ability to execute a program that is only partially in memory would confer many benefits:
  - A program would no longer be constrained by the amount of physical memory that is available. Users would be able to write programs for an extremely large virtual address space, simplifying the programming task.
  - Because each user program could take less physical memory,
     more programs could be run at the sance time, with a
     corresponding increase in CPU utilization and throughput but
     with no increase in response time or turnaround time.
  - Less I/O would be needed to load or swap user programs into memory, so each user program would run faster.

# Virtual Memory



- » Bring a page into memory only when it is needed
  - Less I/O needed
  - Less memory needed
  - Faster response
  - More users

- » Page is needed ⇒ reference to it
- $\Rightarrow$  invalid reference  $\Rightarrow$  abort
- $\Rightarrow$  not-in-memory  $\Rightarrow$  bring to memory
- » Lazy loading never swaps a page into memory unless page will be needed
- » Swapper that deals with pages is a pager



- which with each page table entry a valid—invalid bit is associated  $(v \Rightarrow in\text{-memory}, i \Rightarrow not\text{-in-memory})$ 
  - » Present / Absent in book parlance
- » Initially valid—invalid bit is set to i on all entries
- » Example of a page table snapshot:





- » If there is a reference to a page, first reference to that page will trap to operating system:
- » page fault
- 1. Operating system looks at another table to decide:
  - 1. Invalid reference  $\Rightarrow$  abort
  - 2. Just not in memory
- 2. Get empty frame
- 3. Swap page into frame
- 4. Reset tables
- 5. Set validation bit = v



# Page Table Entry



# Types of Page Faults

- » Soft miss: 10-20 instructions
- » Hard miss: several milliseconds
- » Minor page fault
- » Major page fault
- » Segmentation fault

# Multilevel Page Tables

- » Single page table contains one entry per virtual page per process
  - **»** 4 MB per process.  $2^{32} / 4096 * 4$
- » Multilevel page tables avoid keeping all page tables in memory at a time.

# Multilevel Page Tables



#### Multilevel Page Tables

- » 80386 addressed 4 GB of memory using a two-level page table.
  - » Page directory -> page tables -> frames
- » Pentium Pro
  - » Page directory pointer table
    - » 64 bit so it could address over 4GB
    - » Page Map Level 4
    - $> 2^{48} = 256 \text{ TB}$

- » First used on PowerPC, UltraSPARC and Itanium
- » 64bit page tables are large
  - » 4MB page and 64 bit virtual addresses mean 2<sup>42</sup> page table entries
- » Instead of an entry per page of virtual address pace, one entry per page frame of real memory



- » 64-bit virtual addresses and 4 KB page size with 4 GB of RAM, inverted page table requires on 1,048,476 entries.
- » Each entry tracks (virtual address, process) pair
- » Saves a lots of space, but big down side
  - » Virtual -> Physical mapping much harder
  - » Every memory reference must search the page table not just on page faults

- » TLB can help by holding heavily used pages.
- » On TLB miss, still have to search the inverted table
  - » Feasible way to accomplish is to hash on the virtual address.
  - » If the hash table has as many slots as the machine has physical pages, the bucket will hold a single frame

# Linux Page Table

#### virtual address unused pgd index pmd index offset pte index PTE dir. phys. mem. pmd pgd page-table pointer